Add b300-cw (CoreWeave B300) runner launch script and pool#1730
Open
JordanNanos wants to merge 2 commits into
Open
Add b300-cw (CoreWeave B300) runner launch script and pool#1730JordanNanos wants to merge 2 commits into
JordanNanos wants to merge 2 commits into
Conversation
New CoreWeave B300 cluster: 5 nodes of 8x B300, Slurm partition b300, shared storage on /mnt/vast. Single-node launcher adapted from launch_h200-cw.sh (same CoreWeave salloc + enroot/pyxis pattern) with the framework-tagged benchmark-script selection from launch_b300-nv.sh. Multi-node is not wired up yet and exits with a clear error. Registers pool key b300-cw with one runner (b300-cw_0), following the gb300-cw naming convention. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Replace the initial /mnt/vast (shared NFS) import with the launch_b200-cw.sh node-local /tmp pattern: import the container on the allocated worker under flock and pass the squash path as-is. Avoids the enroot aufs-whiteout failures root-squash NFS triggers (documented in launch_b300-nv.sh), and matches the launcher exercised by the b300-cw smoke test. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the launcher and runner-pool entry for the new CoreWeave B300 cluster (
b300-cw), so its login-node runner(s) can pick up B300 single-node benchmark jobs through Slurm.The cluster is 5 nodes × 8× B300 (x86_64), Slurm partition
b300, compute nodes have enroot only (no Docker). The runner lives on the login node with_workon shared NFS; the launchersallocs a node andsruns the benchmark into it via pyxis.Changes
New launcher (
runners/launch_b300-cw.sh):launch_b200-cw.shCoreWeave template:sallocone node, import the container to that node's local/tmpunderflock(serializes concurrent imports), thensrun --container-imagein the same allocation, passing the squash path as-is (it lives on the worker's/tmp, not visible from the login host)./tmprather than shared/mnt/vastNFS avoids the enrootaufs-whiteoutfailures root-squash NFS triggers (documented inlaunch_b300-nv.sh).<model>_<prec>_b300_<fw>.sh), then the legacy bare/_trtfallback.Runner pool (
.github/configs/runners.yaml):b300-cwkey listing the registered runnerb300-cw_0. Kept as its own pool (not folded intob300) so CoreWeave jobs stay separate from the NVIDIA B300 fleet.Validation
b300-cwrunner: the job matched theb300-cwlabel, the launcher'ssallocgranted a node, andsrunran the container import + benchmark on it. (Smoke run was off a separate branch based on the agentx-v0.4 agentic harness; this PR is the runner-plumbing half.)bash -nclean.Notes for reviewers
b300-cw_0(single digit), matching thegb300-cwconvention; targeting is by the sharedb300-cwlabel, so the exact name only matters forrun-sweep.ymldistribution.slurm,b300-cw— not bareb300, which would make it eligible for NVIDIA-fleet B300 jobs.🤖 Generated with Claude Code